Graphic processors to speed-up simulations 
for the design of high performance solar receptors 
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Abstract 

Graphics Processing Units (GPUs) are now powerful 
and flexible systems adapted and used for other purposes 
than graphics calculations ( General Purpose computation 
on GPU — GPGPU). We present here a prototype to be 
integrated into simulation codes that estimate temperature, 
velocity and pressure to design next generations of solar 
receptors. Such codes will delegate to our contribution 
on GPUs the computation of heat transfers due to radia- 
tions. We use Monte-Carlo line-by-line ray-tracing through 
finite volumes. This means data-parallel arithmetic trans- 
formations on large data structures. Our prototype is in- 
spired on the source code of GPUBench. Our performances 
on two recent graphics cards (Nvidia 7800GTX and ATI 
RX1800XL) show some speed-up higher than 400 compared 
to CPU implementations leaving most of CPU computing 
resources available. As there were some questions pending 
about the accuracy of the operators implemented in GPUs, 
we start this report with a survey and some contributed tests 
on the various floating point units available on GPUs. 



1. Introduction 

Graphics Processing Units (GPU) offer computing re- 
sources higher than the ones available on processors 
0. With the delivery of the latest generations of 
GPUs, they can be used for general processing (GPGPU, 
|www . gpgpu . org) l (7) and become application specific 
co-processors for regular and heavily data-parallel process- 
ing. 

We strongly believe that the development of GPGPU will 
necessary pass through the identification of key applications 
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that will benefit from the various hardwired functionalities 
available on GPU. We describe the architecture of GPUs 
and properties of the implemented floating point arithmetic 
discovered with our tests in Section [2] Section [3] presents 
the accurate estimation of radiative heat transfers due to the 
filtering of incidental lines and the generation of heat in- 
duced lines. We elaborate on the performances of our pro- 
totype and we present perspectives in Section [4] We do 
not account for diffusion in this preliminary study as our 
medium does not contain particles. 

To the best of our knowledge, there is no prior art in the 
implementation of the tasks reported here on GPUs. Monte- 
Carlo ray-tracing and line-by-line analysis are routinely per- 
formed on CPUs for simulations of radiative heat transfers 
though these tasks usually saturate CPUs leaving no oppor- 
tunity to the coupling of convective and radiative phenom- 
ena on real applications. Other applications heavily rely on 
elaborate physical models f5] HQ. Most simulations are cur- 
rently performed for simple reference cases (isothermal gas 
column at equilibrium). The description of gas spectrum is 
generally simplified in calculation with engineering inter- 
ests leading to errors in the range of 5-15% 

2. Graphics Processing Units (GPU) 

GPUs handle mostly geometrical objects and pixels. Im- 
ages are created by applying geometrical transformations to 
vertices and by splitting objects into pixels. Calculations 
are carried out by various stages composing the graphics 
pipeline presented in Figure Q] Actual pipelines of existing 
GPUs differ slightly. Manufacturers move, share, duplicate 
or add resources depending on boards and processors. The 
figure shows the various stages on the example of a trian- 
gle. In this example, vertex shaders treat 3 vertices whereas 
pixel shaders treat 17 pixels. For most geometrical objects, 
the number of pixels is larger than the number of vertices. 
Modern architectures contain more pixel shaders than ver- 
tex shaders. The current ratio is commonly 24 against 8. 



Figure 1. Model of the graphics pipeline. 



Figure 3. Pixel shader of the Nvidia 7800GTX. 
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Figure 2. Vertex shader of the Nvidia 
7800GTX. 



The host sends vertices to position primitive geometrical 
objects (points, lines, polygons). Objects are transformed 
(rotation, translation, illumination. . . ) and assembled to 
create more elaborate objects. These operations are carried 
out by vertex shaders. 

At each cycle, each vertex shader (see Figure [2] adapted 
from [8 1) is able to initiate a Multiply and Add (MAD) op- 
eration on 4 pieces of data in the vector unit and a special 
operation in the scalar unit. The implemented special op- 
erations are exponential functions (exp, log), trigonomet- 
ric functions (sin, cos) and reciprocal functions (1/x and 
1/y/x). Since hardware support of DirectX 9.0, vertex 
shaders are able to address texture memory through a dedi- 
cated unit. 

The first floating point unit of each pixel shader (see Fig- 
ure[3]adapted from @) carries out 4 MADs or an access to 
texture via a dedicated unit. The result is then sent to the 
second floating unit which carries out 4 MADs. In the case 
of Nvidia 7800 GTX, each pixel shader includes a level 1 
texture cache. 

Table Q] presents the floating point formats implemented 
on GPUs and CPUs. A number is represented by its man- 
tissa, its exponent e and its sign bit s. The first bit of the 
mantissa (left of the fraction point) can be set to 1 un- 



less the number to be represented is very small. The re- 
maining bits form the fraction /. A normal representa- 
tion stores (— l) s ■ l.f ■ 2 e and a subnormal one stores 
(— l) s • 0./ ■ 2 6min where e m i n is the minimum allowed 
exponent. Single precision (32 bit) became available on 
GPUs with Shader Model 3.0. Manufacturers of GPUs do 
not claim full compatibility with ANSI-IEEE standard on 
floating point arithmetic. 

Before porting our application to GPUs, we surveyed 
two pieces of software testing performances and implemen- 
tations of floating point arithmetic on Nvidia 7800GTX and 
ATI RX1800XL |T]|4). Tests have drawn the first following 
conclusions: 

• Additions and multiplications are truncated. 

• Subtractions seem to benefit from a guard bit with 
Nvidia but not with ATI. 

• Multiplications attain faithful rounding. 

• Errors on divisions indicate that divisions are based on 
multiplications by approximations of the reciprocal. 

We wrote additional test vectors summarized in Table [2] 
where ©, 0, <g> are the addition, subtraction and multiplica- 
tion operators implemented on GPU. U[a, b) are uniformly 
distributed random variables on [a, b). Random tests are 
performed on 2 23 inputs, other tests are exhaustive. 

We used OpenGL primitives and stored data in textures 
using Frame Buffer Object and respectively texRECT and 
tex2D for Nvidia and ATI chips. We set up vertex and 
pixel shaders for computation with the OpenGL shading 
language. We ran these tests on Nvidia 7800 GTX with 
driver ForceWare 81.98 and on ATI RX1800XL with driver 
Catalyst 6.3. 

On some architectures, internal registers store numbers 
with a precision higher than the one used in memory or with 
a larger dynamics for the exponents. Sometimes MADs 
maintain larger accumulators or round results only once, af- 
ter the addition. Our tests showed that no such things occur 
on GPUs but they revealed a surprising feature. It appears 
that the second pixel shader floating point unit on ATI and 



Table 1. Representation format of floating point numbers on GPUs and CPUs. 



Reference 


Number of bits 


Non numerical 




Total 


Sign 


Exponent 


Fraction 


values 


Nvidia 


16 




5 


10 


NaN, Inf 




32 




8 


23 


(as documented in 12) 


ATI 


16 




5 


10 


Not implemented 




24 




7 


16 






32 




8 


23 


Not documented 


ANSI-IEEE 754 Hi] 


32 




8 


23 


NaN, Inf 




64 




11 


52 





Table 2. Arithmetic experimentations and results. 
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ATI-Pixel 


X \ I ^ ZO ' z 

z = 24 — ► -2~ 23 
25 < i — > 


Nvidia-Pixel 


1 < i < 23 — > -2~ % 
24 < i < 25 — > -2~ 23 
26 < i — > 


x®y + (±x) ® (TV) 


All 
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ATI-Pixel 
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Nvidia-Pixel 


i < 23 • 2 iY — > correct 


ATI-Vertex 


z < 2^ 3 — > correct 


Nvidia- Vertex 


i < 2 1M — > correct 


x <g> y — x x y 


ATI-Pixel 


x e [1, 2) A y e [1, 2/x) — > {-1.00031 ulp • • • 0.00215 ulp} 
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Nvidia-Pixel 
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both units on Nvidia produce a mantissa with one extra bit. 
This extra bit forces modifications of some multiple preci- 
sion tools 131 and we conjecture that it is implemented for 
backward compatibility. 

Fast small multipliers usually ignore partial products be- 
low a given threshold and add a constant to correct the in- 
troduced statistical bias 1101 . Results lead us to think that 
this constant is 2~ 35 on ATI and 41 • 2~ 30 on Nvidia. The 
multipliers accumulate partial products on 9 extra rows on 
ATI and 6 extra rows on Nvidia. These figures do not in- 
clude the extra bit left of the mantissa. Other tests indicate 
that multipliers use radix 2 sign-magnitude logic internally. 

Additional tests showed that subnormal numbers are re- 
placed by during transfers even when no arithmetic op- 
eration is performed on GPUs meaning that drivers proba- 
bly perform arithmetic operations on textures. Non numer- 
ical quantities are not modified except that sNaN {signaling 
NaN) is changed to qNaN (quiet NaN) on ATI. 

3. Monte- Carlo line-by-line ray tracing 

The experimental setting is presented in Figure [4] This 
device produces electricity from sunlight concentrated by a 
large reflector. Concentrated sunlight is used to heat a metal 
pipe that transfers heat through contact and infra-red radi- 
ations. The goal is to transfer as much energy as possible 
to the turbine. Dynamic and thermal phenomena are intri- 
cately interwoven as air temperature raises. 

Though our approach is based on finite volumes used 
for example by Fluent dwww . fluent . coml) and Trio- 
U ( |www-trio-u . cea . f r) , this work can also be 
applied to accurately instantiate source terms in soft- 
ware based on finite element methods such as ComSol 
dwww . comsol . f rl l. 

Combined optical depth r of infrared participating gases 
CO2 and H2O (O2 and N2 are ignored) represents millions 
of lines that are functions of temperature T, pressure p, and 
density of absorbing molecule u g in the following formulas 



copied from J5] Annex A. 2] with the same notations. 



S W (T) Q(T ref ) e-^fc 




t{u) = ^2,u g ^2 S m >(T)f{v- v m ,) (2) 
lout = / fa e- T ^+/(^r)(l-e- T ^)(3) 



I{u,T) = 




The first formula provides a ratio S vrj ' (T) /S vv > (T re f) for 
the intensity of the line due to transition between lower 
and upper states r\ and rj' of component gas g centered on 
wavenumber v vv '. This ratio is applied to the 16 contribu- 
tions of this line in the wavelength space around v m i . Once 
this transformation is performed for all the lines of all the 
gases, the contributions are cumulated to obtain t(v) for 
all the considered wavenumbers v. We apply Beer-Lamber 
law for absorption (first term of 7 0Ut ) and Planck law for 
heat induced emissions (second term of 7 out ) for a ray pass- 
ing through length I of an isothermal homogeneous finite 
volume of Figure [5] GPUs handle all data-parallel compu- 
tations and Listing [TJpresents how formulas (|2}, (01 and (|4|i 
are implemented. 

After the GPU has computed 7 out for all the considered 
wavenumbers the power of the total heat transfer is obtained 
by summing /;„ — 7 0Ut of up to 2 24 w 16 ■ 10 6 values stored 
in a 2 dimensional square matrix. This task requires to sum 
all the data of a texture and we used a parallel reduction 
scheme adapted to GPUs |8 |. The sum is evaluated with an 
iterative algorithm where each iteration sums of 4 pieces of 
data from the previous iteration. 

Integrations with respect to space in the simulation of 
non-isothermal flows is obtained by Monte-Carlo line-by- 
line ray-tracing paradigm as presented Figure [5] The main 



Random ray \ (direction and origin) 



Listing 1. Parallel evaluation of (2)-(4\ 

! ! ARBfpl . 



# srat io_g { 1-2 } = S vv i (T)/S vv i (T ref ) are computed 

# in the omitted part from 0) 

# and stored in a texture 

TEX sratio_gl, sig_coords_l, texture [4], RECT; 
TEX sratio_g2, sig_coords_2 , texture [4], RECT; 

# tref_g { 1-2 } = u g S vv i (T Te f )f(v — i/,,„i) are constant 

# textures computed on CPU and trans fered to 

# GPU memory on program initialization 

TEX tref_gl, vnu_coords_l , texture [5], RECT; 
TEX tref_g2, vnu_coords_2 , texture [5], RECT; 

# Suming up the contributions of the two gases 
MUL tau, sratio_gl, tref_gl; 

MAD tau, sratio_g2, tref_g2, tau; 

MUL tau, tau, 11; 

# 11 = -Z/M 2 ) i s a scalar 

# set for each iteration 

# Factors $l/ln(2)$ are introduced as GPUs 

# currently only support base-2 exponentials 

# Special functions need 4 invocations 
EX2 exp_tau_l.x, tau.x; 

EX2 exp_tau_l.y, tau.y; 
EX2 exp_tau_l.z, tau.z; 
EX2 exp_tau_l.w, tau.w; 

MUL exponent, c2T, nu; 

# c2T = c 2 /(Tln(2)) is a scalar 

# set for each iteration 

EX2 den.x, exponent. x; 

EX2 den.y, exponent. y; 

EX2 den . z , exponent. z; 

EX2 den.w, exponent. w; 

SUB den, den, (1, 1, 1, 1 } ; 

RCP inv.x, den.x; 

RCP inv.y, den.y; 

RCP inv.z, den.z; 

RCP inv.w, den.w; 

MUL nu3, nu, nu; 
MUL nu3, nu3, nu; 
MUL nu3, nu3, hc2; 

# hc2 = 2h/c is a constant scalar 



MUL factorl, inv, nu3; 

SUB factor2, one, exp_tau_l; 

MUL term, i_in, exp_tau_l; 

# i_in = /i n is the Tout texture of 

# tiie previous iteration 

# Return result . color = /out as the color of 

# tiie pixel to be written 

MAD result . color, factorl, factor2, term; 
END 
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Figure 5. Monte-Carlo ray-tracing. 



simulation code on CPU directs this process and averages 
the effect of individual rays. 

We designed a program in two parts. The first part is 
executed by the CPU and represents 3500 lines of C++ 
code and OpenGL directives. The second part is executed 
by the pixel shaders of the GPU and represents 250 lines 
of OpenGL shading primitives (ARB fragment program). 
Listing[T]is extracted from these 250 lines and corresponds 
to the evaluation of formulas (0, ([3]) and In addition, we 
wrote the same program in C to measure the benefit of the 
use of a GPU. This code was compiled with Microsoft Vi- 
sual C++ 2005, optimizing for speed (/Ox /arch:SSE). 
We ran both programs on a Pentium 4 system with 1 GB 
of DDR2 memory and with a Nvidia 7800GTX and an ATI 
RX1800XL graphic card both with 256 MB of GDDR3. 

We measured the number of lines evaluated per second 
depending on the number of lines per ray. The results are 
plotted in Figure|6]with logarithmic axes and show a speed- 
up as high as 420 compared to CPU. Timing is done on 100 
rays treated sequentially. GPU performance loss around 10 6 
lines per ray is due to data too large to fit in graphic memory 
and should disappear with newer GPU boards. 

This impressive speed-up is partially due to the ability of 
GPUs to perform many complex operations per cycle. Each 
pixel shader can start one exponential per cycle thanks to 
dedicated hardware. As there are up to 24 shaders, 24 ex- 
ponentials are initiated at 486 Mhz leading to 13.2 10 9 ex- 
ponentials per second. On CPUs, exponential functions are 
evaluated in software or in micro-code and require typically 
100 cycles to complete. On a 3 Ghz Pentium 4 this means 
about 30 10 6 exponentials per second. The second reason 
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Figure 6. Number of lines treated by second. 

for our speed-up lies in the fact that GPUs and drivers ex- 
ploit regularity in the code to hide memory latency and ex- 
ecute floating point operations in parallel in pixel shaders. 

4. Conclusion and perspectives 

We started this report with test vectors aimed at the char- 
acterization of the floating point operators of GPUs. We 
showed that: 

• Temporary results are computed to 32 bit format. 

• Multipliers use constants to compensate for discarded 
partial products. 

• Some Nvidia and ATI adders use an extra bit. 

We will certainly set up more test vectors as we continue 
working on GPUs. Up to date tests are available from the 
authors upon request by email. 

We accelerated the computation of radiation properties 
in order to simulate precisely, i.e. using line-by-line spec- 
tra of gases. Common speed-up brought by GPU start at 5 
and may climb to 50 as some developments in the industry 
are claiming Our GPU implementation is 400 times faster 
than CPU evaluation. This performance almost preserves 
the computing resource available on CPU as we noticed 
a runtime increase below 1% program saturates our CPU 
and GPU compared to the same program with no request to 
GPU. 

These figures where obtained using a fixed number (16) 
of points of evaluation for each line. Our next version will 
dynamically adapt the number of points depending on the 
local temperature and the intensity of the line. This tasks 
will involve vertex shaders and blending units. Blending 
units starting with Nvidia 8800 operate on 32 bit floating 

^ee jhttp : / /www . emp hotonics . com/£ast£dtd.htmT| 



point data. Work on radiosity will be performed only if dis- 
crepancies between simulations and experimentations show 
that the effect of diffusion cannot be ignored. 

The impressive speed-up reported here was due to the 
large number of lines for one single ray-tracing leading to 
■d a huge amount of data parallel transformations. Similar 
o. speed-ups may be obtained for other settings. One possi- 
ble application of GPGPU with connection to the industry, 
is to speed-up simulations of elaborate surfaces of planar 
solar receptors. Software will average spectral effects to 
two bands of wavelength (infrared and visible) but it will 
consider a large number of independent rays to accurately 
account for anisotropic reflections and absorptions. 

As we are building know-how on porting simulations for 
thermal sciences to GPUs we will explore automatic tools 
and build libraries of techniques to efficiently reuse parts of 
our developments. 
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